https://spark.apache.org/docs/latest/
https://spark.apache.org/docs/latest/api/python/index.html
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package
https://spark.apache.org/docs/latest/spark-standalone.html
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
https://data-flair.training/blogs/spark-tutorial/
https://www.digitalocean.com/community/tutorials/an-introduction-to-mesosphere
https://github.com/shathil/BigDataExercises
From the above url, download the git repository as a zip file. From the dowloaded zip package, extract a folder called resources and place it inside your spark 2.x folder.
Hint:- flatMap-ed RDD might not work like map or filter functions. Calling actions on these flatMap-ed RDD directly, will definitely throw an error. Can you guess why?. If you have already experienced this, and solved it, congrats!.
If your hardware is below 8GB and has other processes running , stop before 1 million. Try this when no other processes are running.
In [ ]:
# combining with just values
rdd1 = sc.parallelize([("foo", 1), ("bar", 2), ("baz", 3)])
rdd2 = sc.parallelize([("foo", 4), ("bar", 5), ("bar", 6)])
In [ ]:
# combining with just list items
words1 = sc.parallelize(["Hello", "Human"])
words2 = sc.parallelize(["world", "all", "you", "Mars"])
Think about displaying words with highest values in descending order.
For finding the existing shell sc configuration,
Finding "spark.dynamicAllocation.enabled" status,
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-dynamic-allocation.html
For setting dynamicAllocation,
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-SparkConf.html
So did you get curious why we set the Dynamic Allocation in the config?. Have you found out where it is useful?.
https://docs.databricks.com/spark/latest/gentle-introduction/sparksession.html
For Python, https://github.com/apache/spark/tree/master/examples/src/main/python/sql
For Scala, https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examples/sql
In [ ]:
## Question 13 - TODO Week 3
* Loading data from text file